The value around which the distribution is centered
- Mean
- Median
- Mode (most common value)
The value around which the distribution is centered
Sample (arithmetic) mean:
\[\bar{Y} = \frac{\sum^n_{i=1}Y_i}{n}\]
The term “mean” is preferred to “average.” The arithmetic mean is one kind of average, but there are others (as there are other types of means).
The median is the central measurement of a sample (the 50th percentile and the 0.50 quantile). If \(n\) is even, then the median is the mean of the two middle observations.
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
median(1:10)
## [1] 5.5
quantile(1:10, prob = 0.5)
## 50% ## 5.5
Number of lateral plates (plates) in threespine sticklebacks (Gasterosteus aculeatus) with three different Ectodysplasin genotypes (mm, Mm, and MM).
glimpse(SticklebackPlates)
## Rows: 344 ## Columns: 2 ## $ genotype <chr> "mm", "Mm", "Mm", "Mm", "mm", "mm", "Mm", "Mm", "Mm", "Mm", … ## $ plates <dbl> 11, 63, 22, 10, 14, 11, 58, 36, 31, 61, 63, 8, 12, 11, 64, 6…
SticklebackPlates %>% group_by(genotype) %>%
summarize(mean_plate = mean(plates),
median_plate = median(plates),
.groups = "drop")
## # A tibble: 3 x 3 ## genotype mean_plate median_plate ## * <chr> <dbl> <dbl> ## 1 mm 11.7 11 ## 2 Mm 50.4 59 ## 3 MM 62.8 63
Mean undulation rate for \(n = 8\) gliding snakes:
undulation_rate <- c(0.9, 1.2, 1.2, 1.3, 1.4, 1.4, 1.6, 2.0)
What is the mean undulation rate for this sample of flying snakes?
Arithmetic mean:
\[\hat{Y} = \frac{\sum_{i=1}^{n}Y_i}{n}\]
\[mean~undulation~rate = \frac{\sum_{i=1}^{n}undulation~rate_i}{n}\]
sum(undulation_rate) / length(undulation_rate)
## [1] 1.375
mean(undulation_rate)
## [1] 1.375
Use dnorm() to calculate the relative likelihood of an observed value \(Y_i\) drawn from a normal distribution given a mean (\(\mu\)) and standard deviation (\(\sigma\)).
\[f\left(Y_i; \mu, \sigma\right) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{\frac{-\left(Y_i-\mu\right)^{2}}{2\sigma^{2}}}\]
dnorm(0, mean = 0, sd = 1)
## [1] 0.3989423
Hypothesizing that the population mean is 0 and the standard deviation is 1, what is the likelihood of the observed values?
For a set of observations (\(Y_i\)) and hypothesized parameters (\(\Theta\); i.e., mean and standard deviation) the model likelihood is the product of the observations’ individual likelihoods:
\[\mathcal{L}\left(\left\{ Y_{i}\right\} _{i=1}^{n};\Theta\right) = \prod_{i=1}^{n} Pr\left[Y_{i}; \Theta\right]\]
\[\log\left(\mathcal{L}\left(\left\{ Y_{i}\right\} _{i=1}^{n};\Theta\right)\right) = \sum_{i=1}^{n} \log\left(Pr\left[Y_{i};\Theta\right]\right)\]
Hypothesizing that the population mean is 0 and the standard deviation is 1, what is the likelihood of the observed values?
Likelihood for the first observation (undulation_rate[1]):
undulation_rate[1]
## [1] 0.9
dnorm(undulation_rate[1], mean = 0, sd = 1)
## [1] 0.2660852
This is only the likelihood for one observation. We need the likelihoods for all 8 undulation rates to get a model likelihood.
Vector of likelihoods for all values in undulation_rate given mu = 0 and sigma = 1:
(rel_liks <- dnorm(undulation_rate, mean = 0, sd = 1))
## [1] 0.26608525 0.19418605 0.19418605 0.17136859 0.14972747 0.14972747 0.11092083 ## [8] 0.05399097
Model likelihood is the product of those likelihoods:
(lik <- prod(rel_liks))
## [1] 2.308476e-07
log(lik)
## [1] -15.28151
Rather than logging the product, we can sum the log-likelihoods:
sum(log(rel_liks))
## [1] -15.28151
For a model in which the mean is 0 and standard deviation is 1, the model log-likelihood is -15.28.
Is there another combination of \(\mu\) and \(\sigma\) that gives a higher likelihood (= larger log-likelihood)?
Try \(\mu = 1\) and \(\sigma = 1\):
sum(log(dnorm(undulation_rate, mean = 1, sd = 1)))
## [1] -8.281508
This is an improvement over \(\mu = 0\) and \(\sigma = 1\).
Find the combination of \(\mu\) and \(\sigma\) that maximizes the log-likelihood of the model for the mean and standard deviation of undulation rates.
Ranges of possible values:
For combinations of \(\mu\) and \(\sigma\), calculate the model likelihood. Pick the largest log-likelihood as the parameter estimates.
Set up the grid:
n <- 100 # How fine is the grid? mu <- seq(0.1, 3, length = n) # Vector of mu sigma <- seq(0.1, 0.5, length = n) # Vector of sigma grid_approx <- crossing(mu, sigma)
grid_approx
## # A tibble: 10,000 x 2 ## mu sigma ## <dbl> <dbl> ## 1 0.1 0.1 ## 2 0.1 0.104 ## 3 0.1 0.108 ## 4 0.1 0.112 ## 5 0.1 0.116 ## 6 0.1 0.120 ## 7 0.1 0.124 ## 8 0.1 0.128 ## 9 0.1 0.132 ## 10 0.1 0.136 ## # … with 9,990 more rows
log_lik <- numeric(length = nrow(grid_approx))
for (ii in 1:nrow(grid_approx)) {
log_lik[ii] <-
sum(dnorm(undulation_rate,
mean = grid_approx$mu[ii],
sd = grid_approx$sigma[ii],
log = TRUE))
}
grid_approx$log_lik <- log_lik
grid_approxmu and sigma to log_likgrid_approx
## # A tibble: 10,000 x 3 ## mu sigma log_lik ## <dbl> <dbl> <dbl> ## 1 0.1 0.1 -676. ## 2 0.1 0.104 -624. ## 3 0.1 0.108 -578. ## 4 0.1 0.112 -536. ## 5 0.1 0.116 -499. ## 6 0.1 0.120 -466. ## 7 0.1 0.124 -436. ## 8 0.1 0.128 -408. ## 9 0.1 0.132 -384. ## 10 0.1 0.136 -361. ## # … with 9,990 more rows
On this grid, the maximum likelihood estimates of \(\mu\) and \(\sigma\) are:
grid_approx[which.max(grid_approx$log_lik), ]
## mu sigma log_lik ## 4451 1.388889 0.3020202 -1.810766
The analytical estimates are:
mean(undulation_rate)
## [1] 1.375
sd(undulation_rate)
## [1] 0.324037
Search for the most likely values of \(\mu\) and \(\sigma\) across all possible values.
Define a function that takes a vector of values to optimize x (\(\mu\) and \(\sigma\)) as well as a set of data Y and returns the log-likelihood:
log_lik <- function(x, Y){
liks <- dnorm(Y, mean = x[1], sd = x[2], log = TRUE)
return(sum(liks))
}
We can now simultaneously optimize \(\mu\) and \(\sigma\), maximizing the log-likelihood.
reltol says to stop when the improvement is \(<10^{-100}\).
optim(c(0.1, 0.1), # Start at 0.1, 0.1
log_lik,
Y = undulation_rate,
control = list(fnscale = -1,
reltol = 10^-100))
## $par ## [1] 1.3750000 0.3031089 ## ## $value ## [1] -1.802203 ## ## $counts ## function gradient ## 175 NA ## ## $convergence ## [1] 0 ## ## $message ## NULL
glm() fits generalized linear modules via optimization:
fm <- glm(undulation_rate ~ 1) # Estimate a mean only coef(fm)
## (Intercept) ## 1.375
logLik(fm)
## 'log Lik.' -1.802203 (df=2)
For a small enough tolerance, the maximum likelihood estimate equals the analytical estimate.